Práctica 1: Análisis Exploratorio de Datos (EDA)¶
Introducción¶
En el sector bancario, garantizar decisiones acertadas en la evaluación de solicitudes de préstamos es trascendental para minimizar riesgos y maximizar el beneficio. En esta práctica, se examinará un conjunto de datos relacionado con solicitudes de préstamos, empleando técnicas de Análisis Exploratorio de Datos (EDA) vistas en clase.¶
Objetivos¶
- Identificar y determinar patrones en los datos, los cuales indiquen la capacidad de los solicitantes para cumplir con sus obligaciones financieras. Ratificando que los usuarios capaces de saldar el préstamo no sean rechazados, a la par de detectar perfiles con dificultades para cubrir la deuda.¶
- Responder a la pregunta clave: ¿Hay algún tipo de clientes más propenso a no devolver un préstamo?, guiando al banco en la toma de decisiones para mitigar exposiciones.¶
Desarrollo notebook 1¶
Importar librerías¶
In [1]:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import plotly.express as px
import sys
sys.path.append('/Users/miguelflores/Desktop/P1/practica1')
from funciones import funciones_auxiliares as f_aux
pd.set_option("display.max_rows", 10000)
pd.set_option("display.max_columns", 10000)
pd.set_option("display.width", 10000)
Carga y lectura de dataset¶
In [2]:
# Situamos a la variable 'SK_ID_CURR' como índice, con el fin de acceder fácilmente a los datos usando estos identificadores como claves.
df = pd.read_csv("/Users/miguelflores/Desktop/df_practica1.csv").set_index("SK_ID_CURR")
Es importante señalar que, en este caso, se está empleando una ruta absoluta debido a que el tamaño de la base de datos es tan grande que GitHub no puede soportarlo. En situaciones contrarias, donde el tamaño del archivo no es un impedimiento, se debe leer el fichero de datos desde una ruta relativa que apunte al repositorio de GitHub.¶
In [3]:
df.head()
Out[3]:
| TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SK_ID_CURR | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | -9461 | -637 | -3648.0 | -2120 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.083037 | 0.262949 | 0.139376 | 0.0247 | 0.0369 | 0.9722 | 0.6192 | 0.0143 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0369 | 0.0202 | 0.0190 | 0.0000 | 0.0000 | 0.0252 | 0.0383 | 0.9722 | 0.6341 | 0.0144 | 0.0000 | 0.0690 | 0.0833 | 0.1250 | 0.0377 | 0.022 | 0.0198 | 0.0 | 0.0 | 0.0250 | 0.0369 | 0.9722 | 0.6243 | 0.0144 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0375 | 0.0205 | 0.0193 | 0.0000 | 0.00 | reg oper account | block of flats | 0.0149 | Stone, brick | No | 2.0 | 2.0 | 2.0 | 2.0 | -1134.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | -16765 | -1188 | -1186.0 | -291 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2.0 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.311267 | 0.622246 | NaN | 0.0959 | 0.0529 | 0.9851 | 0.7960 | 0.0605 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0130 | 0.0773 | 0.0549 | 0.0039 | 0.0098 | 0.0924 | 0.0538 | 0.9851 | 0.8040 | 0.0497 | 0.0806 | 0.0345 | 0.2917 | 0.3333 | 0.0128 | 0.079 | 0.0554 | 0.0 | 0.0 | 0.0968 | 0.0529 | 0.9851 | 0.7987 | 0.0608 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0132 | 0.0787 | 0.0558 | 0.0039 | 0.01 | reg oper account | block of flats | 0.0714 | Block | No | 1.0 | 0.0 | 1.0 | 0.0 | -828.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | -19046 | -225 | -4260.0 | -2531 | 26.0 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | NaN | 0.555912 | 0.729567 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -815.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -3039 | -9833.0 | -2437 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2.0 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | NaN | 0.650442 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.0 | 0.0 | 2.0 | 0.0 | -617.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.028663 | -19932 | -3038 | -4311.0 | -3458 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Core staff | 1.0 | 2 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 0 | 1 | 1 | Religion | NaN | 0.322738 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -1106.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Análisis general de la tabla¶
Dimensión¶
In [4]:
print(df.shape, df.drop_duplicates().shape)
(307511, 121) (307511, 121)
Con base a la anterior línea de código anterior, es posible determinar que en el DataFrame contiene 121 variables y 307,511 observaciones.¶
Tipos de datos y su respectivo contenido¶
Con la finalidad de mantener el código limpio y operar de manera eficiente, a partir de esta sección en adelante se emplearán funciones auxiliares. Un documento donde estan registradas las funciones, facilitando la reutilización de código. A continuación se presenta la función tipos_datos( ) con el fin de determinar el tipo de variable y visualizar su respectivo contenido.¶
In [5]:
f_aux.tipos_datos(df)
TARGET int64 Contenido: 1, 0 NAME_CONTRACT_TYPE object Contenido: Cash loans, Revolving loans CODE_GENDER object Contenido: M, F, XNA FLAG_OWN_CAR object Contenido: N, Y FLAG_OWN_REALTY object Contenido: Y, N CNT_CHILDREN int64 Contenido: 0, 1, 2, 3, 4, 7, 5, 6, 8, 9, 11, 12, 10, 19, 14 AMT_INCOME_TOTAL float64 Contenido: Más de 30 valores AMT_CREDIT float64 Contenido: Más de 30 valores AMT_ANNUITY float64 Contenido: Más de 30 valores AMT_GOODS_PRICE float64 Contenido: Más de 30 valores NAME_TYPE_SUITE object Contenido: Unaccompanied, Family, Spouse, partner, Children, Other_A, nan, Other_B, Group of people NAME_INCOME_TYPE object Contenido: Working, State servant, Commercial associate, Pensioner, Unemployed, Student, Businessman, Maternity leave NAME_EDUCATION_TYPE object Contenido: Secondary / secondary special, Higher education, Incomplete higher, Lower secondary, Academic degree NAME_FAMILY_STATUS object Contenido: Single / not married, Married, Civil marriage, Widow, Separated, Unknown
NAME_HOUSING_TYPE object Contenido: House / apartment, Rented apartment, With parents, Municipal apartment, Office apartment, Co-op apartment REGION_POPULATION_RELATIVE float64 Contenido: Más de 30 valores DAYS_BIRTH int64 Contenido: Más de 30 valores DAYS_EMPLOYED int64 Contenido: Más de 30 valores DAYS_REGISTRATION float64 Contenido: Más de 30 valores
DAYS_ID_PUBLISH int64 Contenido: Más de 30 valores OWN_CAR_AGE float64 Contenido: Más de 30 valores FLAG_MOBIL int64 Contenido: 1, 0 FLAG_EMP_PHONE int64 Contenido: 1, 0 FLAG_WORK_PHONE int64 Contenido: 0, 1 FLAG_CONT_MOBILE int64 Contenido: 1, 0 FLAG_PHONE int64 Contenido: 1, 0 FLAG_EMAIL int64 Contenido: 0, 1 OCCUPATION_TYPE object Contenido: Laborers, Core staff, Accountants, Managers, nan, Drivers, Sales staff, Cleaning staff, Cooking staff, Private service staff, Medicine staff, Security staff, High skill tech staff, Waiters/barmen staff, Low-skill Laborers, Realty agents, Secretaries, IT staff, HR staff CNT_FAM_MEMBERS float64 Contenido: 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 9.0, 7.0, 8.0, 10.0, 13.0, nan, 14.0, 12.0, 20.0, 15.0, 16.0, 11.0 REGION_RATING_CLIENT int64 Contenido: 2, 1, 3 REGION_RATING_CLIENT_W_CITY int64 Contenido: 2, 1, 3 WEEKDAY_APPR_PROCESS_START object Contenido: WEDNESDAY, MONDAY, THURSDAY, SUNDAY, SATURDAY, FRIDAY, TUESDAY HOUR_APPR_PROCESS_START int64 Contenido: 10, 11, 9, 17, 16, 14, 8, 15, 7, 13, 6, 12, 19, 3, 18, 21, 4, 5, 20, 22, 1, 2, 23, 0 REG_REGION_NOT_LIVE_REGION int64 Contenido: 0, 1 REG_REGION_NOT_WORK_REGION int64 Contenido: 0, 1 LIVE_REGION_NOT_WORK_REGION int64 Contenido: 0, 1 REG_CITY_NOT_LIVE_CITY int64 Contenido: 0, 1 REG_CITY_NOT_WORK_CITY int64 Contenido: 0, 1 LIVE_CITY_NOT_WORK_CITY int64 Contenido: 0, 1 ORGANIZATION_TYPE object Contenido: Más de 30 valores EXT_SOURCE_1 float64 Contenido: Más de 30 valores EXT_SOURCE_2 float64 Contenido: Más de 30 valores EXT_SOURCE_3 float64 Contenido: Más de 30 valores APARTMENTS_AVG float64 Contenido: Más de 30 valores BASEMENTAREA_AVG float64 Contenido: Más de 30 valores YEARS_BEGINEXPLUATATION_AVG float64 Contenido: Más de 30 valores YEARS_BUILD_AVG float64 Contenido: Más de 30 valores COMMONAREA_AVG float64 Contenido: Más de 30 valores ELEVATORS_AVG float64 Contenido: Más de 30 valores ENTRANCES_AVG float64 Contenido: Más de 30 valores FLOORSMAX_AVG float64 Contenido: Más de 30 valores FLOORSMIN_AVG float64 Contenido: Más de 30 valores LANDAREA_AVG float64 Contenido: Más de 30 valores LIVINGAPARTMENTS_AVG float64 Contenido: Más de 30 valores LIVINGAREA_AVG float64 Contenido: Más de 30 valores NONLIVINGAPARTMENTS_AVG float64 Contenido: Más de 30 valores NONLIVINGAREA_AVG float64 Contenido: Más de 30 valores APARTMENTS_MODE float64 Contenido: Más de 30 valores BASEMENTAREA_MODE float64 Contenido: Más de 30 valores
YEARS_BEGINEXPLUATATION_MODE float64 Contenido: Más de 30 valores YEARS_BUILD_MODE float64 Contenido: Más de 30 valores COMMONAREA_MODE float64 Contenido: Más de 30 valores ELEVATORS_MODE float64 Contenido: 0.0, 0.0806, nan, 0.1611, 0.4028, 0.1208, 0.282, 0.0403, 0.2417, 0.8862, 0.3222, 0.3625, 0.6848, 0.5639, 0.6042, 0.2014, 0.5236, 0.4431, 0.4834, 0.6445, 0.725, 1.0, 0.8459, 0.9667, 0.8056, 0.9264, 0.7653
ENTRANCES_MODE float64 Contenido: Más de 30 valores FLOORSMAX_MODE float64 Contenido: 0.0833, 0.2917, nan, 0.1667, 0.3333, 0.6667, 0.375, 0.0417, 0.25, 0.4583, 0.2083, 0.125, 0.0, 0.5833, 0.625, 0.9167, 0.9583, 0.5417, 1.0, 0.4167, 0.875, 0.7083, 0.75, 0.5, 0.7917, 0.8333 FLOORSMIN_MODE float64 Contenido: 0.125, 0.3333, nan, 0.375, 0.7083, 0.0417, 0.2083, 0.4167, 0.2917, 0.0, 0.5, 0.625, 0.0833, 0.1667, 0.6667, 0.25, 0.5833, 1.0, 0.9583, 0.5417, 0.9167, 0.75, 0.8333, 0.4583, 0.7917, 0.875 LANDAREA_MODE float64 Contenido: Más de 30 valores LIVINGAPARTMENTS_MODE float64 Contenido: Más de 30 valores LIVINGAREA_MODE float64 Contenido: Más de 30 valores NONLIVINGAPARTMENTS_MODE float64 Contenido: Más de 30 valores NONLIVINGAREA_MODE float64 Contenido: Más de 30 valores APARTMENTS_MEDI float64 Contenido: Más de 30 valores BASEMENTAREA_MEDI float64 Contenido: Más de 30 valores YEARS_BEGINEXPLUATATION_MEDI float64 Contenido: Más de 30 valores YEARS_BUILD_MEDI float64 Contenido: Más de 30 valores COMMONAREA_MEDI float64 Contenido: Más de 30 valores ELEVATORS_MEDI float64 Contenido: Más de 30 valores ENTRANCES_MEDI float64 Contenido: Más de 30 valores FLOORSMAX_MEDI float64 Contenido: Más de 30 valores FLOORSMIN_MEDI float64 Contenido: Más de 30 valores LANDAREA_MEDI float64 Contenido: Más de 30 valores LIVINGAPARTMENTS_MEDI float64 Contenido: Más de 30 valores LIVINGAREA_MEDI float64 Contenido: Más de 30 valores NONLIVINGAPARTMENTS_MEDI float64 Contenido: Más de 30 valores NONLIVINGAREA_MEDI float64 Contenido: Más de 30 valores FONDKAPREMONT_MODE object Contenido: reg oper account, nan, org spec account, reg oper spec account, not specified HOUSETYPE_MODE object Contenido: block of flats, nan, terraced house, specific housing TOTALAREA_MODE float64 Contenido: Más de 30 valores WALLSMATERIAL_MODE object Contenido: Stone, brick, Block, nan, Panel, Mixed, Wooden, Others, Monolithic EMERGENCYSTATE_MODE object Contenido: No, nan, Yes OBS_30_CNT_SOCIAL_CIRCLE float64 Contenido: Más de 30 valores DEF_30_CNT_SOCIAL_CIRCLE float64 Contenido: 2.0, 0.0, 1.0, nan, 3.0, 4.0, 5.0, 6.0, 7.0, 34.0, 8.0 OBS_60_CNT_SOCIAL_CIRCLE float64 Contenido: Más de 30 valores DEF_60_CNT_SOCIAL_CIRCLE float64 Contenido: 2.0, 0.0, 1.0, nan, 3.0, 5.0, 4.0, 7.0, 24.0, 6.0 DAYS_LAST_PHONE_CHANGE float64 Contenido: Más de 30 valores FLAG_DOCUMENT_2 int64 Contenido: 0, 1 FLAG_DOCUMENT_3 int64 Contenido: 1, 0 FLAG_DOCUMENT_4 int64 Contenido: 0, 1 FLAG_DOCUMENT_5 int64 Contenido: 0, 1 FLAG_DOCUMENT_6 int64 Contenido: 0, 1 FLAG_DOCUMENT_7 int64 Contenido: 0, 1 FLAG_DOCUMENT_8 int64 Contenido: 0, 1 FLAG_DOCUMENT_9 int64 Contenido: 0, 1 FLAG_DOCUMENT_10 int64 Contenido: 0, 1 FLAG_DOCUMENT_11 int64 Contenido: 0, 1 FLAG_DOCUMENT_12 int64 Contenido: 0, 1 FLAG_DOCUMENT_13 int64 Contenido: 0, 1 FLAG_DOCUMENT_14 int64 Contenido: 0, 1 FLAG_DOCUMENT_15 int64 Contenido: 0, 1 FLAG_DOCUMENT_16 int64 Contenido: 0, 1 FLAG_DOCUMENT_17 int64 Contenido: 0, 1 FLAG_DOCUMENT_18 int64 Contenido: 0, 1 FLAG_DOCUMENT_19 int64 Contenido: 0, 1 FLAG_DOCUMENT_20 int64 Contenido: 0, 1
FLAG_DOCUMENT_21 int64 Contenido: 0, 1 AMT_REQ_CREDIT_BUREAU_HOUR float64 Contenido: 0.0, nan, 1.0, 2.0, 3.0, 4.0 AMT_REQ_CREDIT_BUREAU_DAY float64 Contenido: 0.0, nan, 1.0, 3.0, 2.0, 4.0, 5.0, 6.0, 9.0, 8.0 AMT_REQ_CREDIT_BUREAU_WEEK float64 Contenido: 0.0, nan, 1.0, 3.0, 2.0, 4.0, 5.0, 6.0, 8.0, 7.0 AMT_REQ_CREDIT_BUREAU_MON float64 Contenido: 0.0, nan, 1.0, 2.0, 6.0, 5.0, 3.0, 7.0, 9.0, 4.0, 11.0, 8.0, 16.0, 12.0, 14.0, 10.0, 13.0, 17.0, 24.0, 19.0, 15.0, 23.0, 18.0, 27.0, 22.0 AMT_REQ_CREDIT_BUREAU_QRT float64 Contenido: 0.0, nan, 1.0, 2.0, 4.0, 3.0, 8.0, 5.0, 6.0, 7.0, 261.0, 19.0
AMT_REQ_CREDIT_BUREAU_YEAR float64 Contenido: 1.0, 0.0, nan, 2.0, 4.0, 5.0, 3.0, 8.0, 6.0, 9.0, 7.0, 10.0, 11.0, 13.0, 16.0, 12.0, 25.0, 23.0, 15.0, 14.0, 22.0, 17.0, 19.0, 18.0, 21.0, 20.0
Factores a considerar sobre las variables a emplear en el modelo¶
Analizando las variables, determinamos la existencia de variables a futuro en el conjunto de datos, los cuales pueden llegar a afectar los resultados obtenidos con el modelo, debido a que estas variables en específico son basadas en datos historicos, por lo que en el momento de llamar al modelo, no se encontrarán disponibles.¶
- EXT_SOURCE_1¶
- EXT_SOURCE_2¶
- EXT_SOURCE_3¶
- OBS_30_CNT_SOCIAL_CIRCLE¶
- DEF_30_CNT_SOCIAL_CIRCLE¶
- OBS_60_CNT_SOCIAL_CIRCLE¶
- DEF_60_CNT_SOCIAL_CIRCLE¶
Las variables de EXT_SOURCE_1, EXT_SOURCE_2 y EXT_SOURCE_3, simbolizan un puntaje normalizado de una fuente de datos externa. Las variables de OBS_30_CNT_SOCIAL_CIRCLE, DEF_30_CNT_SOCIAL_CIRCLE, OBS_60_CNT_SOCIAL_CIRCLE y DEF_60_CNT_SOCIAL_CIRCLE son historiales de impago en el circulo cercano del cliente (30 o 60 'Day Past Due').¶
En este caso, no eliminaremos las columnas del DataFrame, ya que consideramos que el cliente te proporciona los datos directamente, sin necesidad de recurrir a fuentes externas o datos historicos para obtener la información.¶
Exploración variable objetivo¶
In [6]:
# Proporción de valores únicos en la variable objetivo TARGET (%)
df_proporcion = df['TARGET']\
.value_counts(normalize=True)\
.mul(100).rename("%").reset_index()
# Conteo de valores únicos en la variable
df_conteo = df['TARGET'].value_counts().reset_index()
# Combinamos los df generamos anteriormente
df_proporcion_conteo = pd.merge(df_proporcion, df_conteo, how='inner')
df_proporcion_conteo
Out[6]:
| TARGET | % | count | |
|---|---|---|---|
| 0 | 0 | 91.927118 | 282686 |
| 1 | 1 | 8.072882 | 24825 |
In [7]:
# Graficar el diagrama de barras
fig = px.bar(df_proporcion_conteo, x = "TARGET", y = "%",
labels = {'TARGET': 'Target', '%': 'Porcentaje'},
title = "Distribución de la variable objetivo")
# Mostrar los valores de cada tipo de TARGET
fig.update_traces(text = df_proporcion_conteo['%'].round(2), # Obtener dos decimales en la cifra
textposition = 'inside',
texttemplate = '%{text}%', # Muestra el valor como porcentaje
)
# Actualiza las propiedades del eje X, sustituyendo los valores 0 y 1 por los tipos de cliente
fig.update_xaxes(tickmode = 'array', tickvals = [0, 1],
ticktext = ['Tipo 0: Cliente sin dificultades de pago', 'Tipo 1: Cliente con dificultades de pago'])
fig.update_yaxes(title_text="Porcentaje")
fig.show()
Con el gráfico anterior, es posible determinar que la probabilidad de obtener una observación aleatoria, de un cliente con dificultades de pago es del 8.07%.¶
Selección de threshold por filas y columnas para eliminar valores missing.¶
A continuación se presenta la función nulos_columna( ), la cual otorga la cantidad de nulos por columna y su porcentaje de estos respectivamente.¶
In [8]:
f_aux.nulos_columna(df)
Out[8]:
| nulos_columnas | porcentaje_columnas | |
|---|---|---|
| COMMONAREA_AVG | 214865 | 69.872297 |
| COMMONAREA_MODE | 214865 | 69.872297 |
| COMMONAREA_MEDI | 214865 | 69.872297 |
| NONLIVINGAPARTMENTS_AVG | 213514 | 69.432963 |
| NONLIVINGAPARTMENTS_MODE | 213514 | 69.432963 |
| NONLIVINGAPARTMENTS_MEDI | 213514 | 69.432963 |
| FONDKAPREMONT_MODE | 210295 | 68.386172 |
| LIVINGAPARTMENTS_MEDI | 210199 | 68.354953 |
| LIVINGAPARTMENTS_AVG | 210199 | 68.354953 |
| LIVINGAPARTMENTS_MODE | 210199 | 68.354953 |
| FLOORSMIN_AVG | 208642 | 67.848630 |
| FLOORSMIN_MODE | 208642 | 67.848630 |
| FLOORSMIN_MEDI | 208642 | 67.848630 |
| YEARS_BUILD_AVG | 204488 | 66.497784 |
| YEARS_BUILD_MEDI | 204488 | 66.497784 |
| YEARS_BUILD_MODE | 204488 | 66.497784 |
| OWN_CAR_AGE | 202929 | 65.990810 |
| LANDAREA_MEDI | 182590 | 59.376738 |
| LANDAREA_AVG | 182590 | 59.376738 |
| LANDAREA_MODE | 182590 | 59.376738 |
| BASEMENTAREA_MEDI | 179943 | 58.515956 |
| BASEMENTAREA_AVG | 179943 | 58.515956 |
| BASEMENTAREA_MODE | 179943 | 58.515956 |
| EXT_SOURCE_1 | 173378 | 56.381073 |
| NONLIVINGAREA_AVG | 169682 | 55.179164 |
| NONLIVINGAREA_MODE | 169682 | 55.179164 |
| NONLIVINGAREA_MEDI | 169682 | 55.179164 |
| ELEVATORS_MODE | 163891 | 53.295980 |
| ELEVATORS_AVG | 163891 | 53.295980 |
| ELEVATORS_MEDI | 163891 | 53.295980 |
| WALLSMATERIAL_MODE | 156341 | 50.840783 |
| APARTMENTS_AVG | 156061 | 50.749729 |
| APARTMENTS_MEDI | 156061 | 50.749729 |
| APARTMENTS_MODE | 156061 | 50.749729 |
| ENTRANCES_AVG | 154828 | 50.348768 |
| ENTRANCES_MEDI | 154828 | 50.348768 |
| ENTRANCES_MODE | 154828 | 50.348768 |
| LIVINGAREA_MEDI | 154350 | 50.193326 |
| LIVINGAREA_MODE | 154350 | 50.193326 |
| LIVINGAREA_AVG | 154350 | 50.193326 |
| HOUSETYPE_MODE | 154297 | 50.176091 |
| FLOORSMAX_MEDI | 153020 | 49.760822 |
| FLOORSMAX_MODE | 153020 | 49.760822 |
| FLOORSMAX_AVG | 153020 | 49.760822 |
| YEARS_BEGINEXPLUATATION_MEDI | 150007 | 48.781019 |
| YEARS_BEGINEXPLUATATION_MODE | 150007 | 48.781019 |
| YEARS_BEGINEXPLUATATION_AVG | 150007 | 48.781019 |
| TOTALAREA_MODE | 148431 | 48.268517 |
| EMERGENCYSTATE_MODE | 145755 | 47.398304 |
| OCCUPATION_TYPE | 96391 | 31.345545 |
| EXT_SOURCE_3 | 60965 | 19.825307 |
| AMT_REQ_CREDIT_BUREAU_WEEK | 41519 | 13.501631 |
| AMT_REQ_CREDIT_BUREAU_HOUR | 41519 | 13.501631 |
| AMT_REQ_CREDIT_BUREAU_MON | 41519 | 13.501631 |
| AMT_REQ_CREDIT_BUREAU_QRT | 41519 | 13.501631 |
| AMT_REQ_CREDIT_BUREAU_DAY | 41519 | 13.501631 |
| AMT_REQ_CREDIT_BUREAU_YEAR | 41519 | 13.501631 |
| NAME_TYPE_SUITE | 1292 | 0.420148 |
| DEF_30_CNT_SOCIAL_CIRCLE | 1021 | 0.332021 |
| OBS_60_CNT_SOCIAL_CIRCLE | 1021 | 0.332021 |
| OBS_30_CNT_SOCIAL_CIRCLE | 1021 | 0.332021 |
| DEF_60_CNT_SOCIAL_CIRCLE | 1021 | 0.332021 |
| EXT_SOURCE_2 | 660 | 0.214626 |
| AMT_GOODS_PRICE | 278 | 0.090403 |
| AMT_ANNUITY | 12 | 0.003902 |
| CNT_FAM_MEMBERS | 2 | 0.000650 |
| DAYS_LAST_PHONE_CHANGE | 1 | 0.000325 |
| AMT_INCOME_TOTAL | 0 | 0.000000 |
| FLAG_DOCUMENT_8 | 0 | 0.000000 |
| CODE_GENDER | 0 | 0.000000 |
| FLAG_OWN_CAR | 0 | 0.000000 |
| FLAG_OWN_REALTY | 0 | 0.000000 |
| FLAG_DOCUMENT_2 | 0 | 0.000000 |
| FLAG_DOCUMENT_3 | 0 | 0.000000 |
| FLAG_DOCUMENT_4 | 0 | 0.000000 |
| FLAG_DOCUMENT_5 | 0 | 0.000000 |
| FLAG_DOCUMENT_6 | 0 | 0.000000 |
| FLAG_DOCUMENT_7 | 0 | 0.000000 |
| FLAG_DOCUMENT_9 | 0 | 0.000000 |
| FLAG_DOCUMENT_21 | 0 | 0.000000 |
| FLAG_DOCUMENT_10 | 0 | 0.000000 |
| FLAG_DOCUMENT_11 | 0 | 0.000000 |
| CNT_CHILDREN | 0 | 0.000000 |
| FLAG_DOCUMENT_13 | 0 | 0.000000 |
| FLAG_DOCUMENT_14 | 0 | 0.000000 |
| FLAG_DOCUMENT_15 | 0 | 0.000000 |
| FLAG_DOCUMENT_16 | 0 | 0.000000 |
| FLAG_DOCUMENT_17 | 0 | 0.000000 |
| FLAG_DOCUMENT_18 | 0 | 0.000000 |
| FLAG_DOCUMENT_19 | 0 | 0.000000 |
| FLAG_DOCUMENT_20 | 0 | 0.000000 |
| FLAG_DOCUMENT_12 | 0 | 0.000000 |
| AMT_CREDIT | 0 | 0.000000 |
| ORGANIZATION_TYPE | 0 | 0.000000 |
| NAME_INCOME_TYPE | 0 | 0.000000 |
| LIVE_CITY_NOT_WORK_CITY | 0 | 0.000000 |
| NAME_CONTRACT_TYPE | 0 | 0.000000 |
| REG_CITY_NOT_WORK_CITY | 0 | 0.000000 |
| REG_CITY_NOT_LIVE_CITY | 0 | 0.000000 |
| LIVE_REGION_NOT_WORK_REGION | 0 | 0.000000 |
| REG_REGION_NOT_WORK_REGION | 0 | 0.000000 |
| REG_REGION_NOT_LIVE_REGION | 0 | 0.000000 |
| HOUR_APPR_PROCESS_START | 0 | 0.000000 |
| WEEKDAY_APPR_PROCESS_START | 0 | 0.000000 |
| REGION_RATING_CLIENT_W_CITY | 0 | 0.000000 |
| REGION_RATING_CLIENT | 0 | 0.000000 |
| FLAG_EMAIL | 0 | 0.000000 |
| FLAG_PHONE | 0 | 0.000000 |
| FLAG_CONT_MOBILE | 0 | 0.000000 |
| FLAG_WORK_PHONE | 0 | 0.000000 |
| FLAG_EMP_PHONE | 0 | 0.000000 |
| FLAG_MOBIL | 0 | 0.000000 |
| DAYS_ID_PUBLISH | 0 | 0.000000 |
| DAYS_REGISTRATION | 0 | 0.000000 |
| DAYS_EMPLOYED | 0 | 0.000000 |
| DAYS_BIRTH | 0 | 0.000000 |
| REGION_POPULATION_RELATIVE | 0 | 0.000000 |
| NAME_HOUSING_TYPE | 0 | 0.000000 |
| NAME_FAMILY_STATUS | 0 | 0.000000 |
| NAME_EDUCATION_TYPE | 0 | 0.000000 |
| TARGET | 0 | 0.000000 |
Basandonos en la proporción de valores nulos que contienen las variables o bien el threshold, se opto por mantener las variables, debido a que consideramos necesario profundizar más la relevancia de estas, por medio de ver su desempeño con el modelo.¶
Preprocesamiento inicial de algunas variables¶
En este caso, se transforma una variable categórica en formato de texto a una representación numérica, con el objetivo de facilitar su procesamiento y hacer su manejo más eficiente en el análisis.¶
In [9]:
dia = { "MONDAY": 1, "TUESDAY": 2, "WEDNESDAY": 3, "THURSDAY": 4, "FRIDAY": 5, "SATURDAY": 6, "SUNDAY": 7}
df['NWEEKDAY_PROCESS_START'] = df['WEEKDAY_APPR_PROCESS_START'].replace(dia)
/var/folders/bj/jtm72vws3zncs7grnzmn95yh0000gn/T/ipykernel_12087/367575524.py:3: FutureWarning:
Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
In [10]:
# Agreamos la nueva variable generada al DataFrame y posteriormente rectificamos la creación de esta.
nueva_columna = 'NWEEKDAY_PROCESS_START' in df.columns
nueva_columna
Out[10]:
True
In [11]:
# Eliminamos la columna preprocesada y rectificamos la eliminación de la variable del DataFrame.
df.drop("WEEKDAY_APPR_PROCESS_START", axis=1, inplace=True)
columna_existe = 'WEEKDAY_APPR_PROCESS_START' in df.columns
columna_existe
Out[11]:
False
Bajo la premisa de que existen variables booleanas representadas como cadenas de texto, tales como 'YES', 'NO', 'y', 'n', o cualquier otra variación. Se define la funcion valores_booleanos( ), la cual reemplaza estos valores por 1 en los casos de afirmación y por 0 en los casos de negación.¶
In [12]:
f_aux.valores_booleanos(df)
/Users/miguelflores/Desktop/P1/practica1/funciones/funciones_auxiliares.py:82: FutureWarning:
Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
Out[12]:
| TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | NWEEKDAY_PROCESS_START | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SK_ID_CURR | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 100002 | 1 | Cash loans | M | 0 | 1 | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | -9461 | -637 | -3648.0 | -2120 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.083037 | 0.262949 | 0.139376 | 0.0247 | 0.0369 | 0.9722 | 0.6192 | 0.0143 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0369 | 0.0202 | 0.0190 | 0.0000 | 0.0000 | 0.0252 | 0.0383 | 0.9722 | 0.6341 | 0.0144 | 0.0000 | 0.0690 | 0.0833 | 0.1250 | 0.0377 | 0.0220 | 0.0198 | 0.0 | 0.0000 | 0.0250 | 0.0369 | 0.9722 | 0.6243 | 0.0144 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0375 | 0.0205 | 0.0193 | 0.0000 | 0.0000 | reg oper account | block of flats | 0.0149 | Stone, brick | No | 2.0 | 2.0 | 2.0 | 2.0 | -1134.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 3 |
| 100003 | 0 | Cash loans | F | 0 | 0 | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | -16765 | -1188 | -1186.0 | -291 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2.0 | 1 | 1 | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.311267 | 0.622246 | NaN | 0.0959 | 0.0529 | 0.9851 | 0.7960 | 0.0605 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0130 | 0.0773 | 0.0549 | 0.0039 | 0.0098 | 0.0924 | 0.0538 | 0.9851 | 0.8040 | 0.0497 | 0.0806 | 0.0345 | 0.2917 | 0.3333 | 0.0128 | 0.0790 | 0.0554 | 0.0 | 0.0000 | 0.0968 | 0.0529 | 0.9851 | 0.7987 | 0.0608 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0132 | 0.0787 | 0.0558 | 0.0039 | 0.0100 | reg oper account | block of flats | 0.0714 | Block | No | 1.0 | 0.0 | 1.0 | 0.0 | -828.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 |
| 100004 | 0 | Revolving loans | M | 1 | 1 | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | -19046 | -225 | -4260.0 | -2531 | 26.0 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | NaN | 0.555912 | 0.729567 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -815.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 |
| 100006 | 0 | Cash loans | F | 0 | 1 | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -3039 | -9833.0 | -2437 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2.0 | 2 | 2 | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | NaN | 0.650442 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.0 | 0.0 | 2.0 | 0.0 | -617.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 3 |
| 100007 | 0 | Cash loans | M | 0 | 1 | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.028663 | -19932 | -3038 | -4311.0 | -3458 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Core staff | 1.0 | 2 | 2 | 11 | 0 | 0 | 0 | 0 | 1 | 1 | Religion | NaN | 0.322738 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -1106.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 456251 | 0 | Cash loans | M | 0 | 0 | 0 | 157500.0 | 254700.0 | 27558.0 | 225000.0 | Unaccompanied | Working | Secondary / secondary special | Separated | With parents | 0.032561 | -9327 | -236 | -8456.0 | -1982 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Sales staff | 1.0 | 1 | 1 | 15 | 0 | 0 | 0 | 0 | 0 | 0 | Services | 0.145570 | 0.681632 | NaN | 0.2021 | 0.0887 | 0.9876 | 0.8300 | 0.0202 | 0.22 | 0.1034 | 0.6042 | 0.2708 | 0.0594 | 0.1484 | 0.1965 | 0.0753 | 0.1095 | 0.1008 | 0.0172 | 0.9782 | 0.7125 | 0.0172 | 0.0806 | 0.0345 | 0.4583 | 0.0417 | 0.0094 | 0.0882 | 0.0853 | 0.0 | 0.0125 | 0.2040 | 0.0887 | 0.9876 | 0.8323 | 0.0203 | 0.22 | 0.1034 | 0.6042 | 0.2708 | 0.0605 | 0.1509 | 0.2001 | 0.0757 | 0.1118 | reg oper account | block of flats | 0.2898 | Stone, brick | No | 0.0 | 0.0 | 0.0 | 0.0 | -273.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 4 |
| 456252 | 0 | Cash loans | F | 0 | 1 | 0 | 72000.0 | 269550.0 | 12001.5 | 225000.0 | Unaccompanied | Pensioner | Secondary / secondary special | Widow | House / apartment | 0.025164 | -20775 | 365243 | -4388.0 | -4090 | NaN | 1 | 0 | 0 | 1 | 1 | 0 | NaN | 1.0 | 2 | 2 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | XNA | NaN | 0.115992 | NaN | 0.0247 | 0.0435 | 0.9727 | 0.6260 | 0.0022 | 0.00 | 0.1034 | 0.0833 | 0.1250 | 0.0579 | 0.0202 | 0.0257 | 0.0000 | 0.0000 | 0.0252 | 0.0451 | 0.9727 | 0.6406 | 0.0022 | 0.0000 | 0.1034 | 0.0833 | 0.1250 | 0.0592 | 0.0220 | 0.0267 | 0.0 | 0.0000 | 0.0250 | 0.0435 | 0.9727 | 0.6310 | 0.0022 | 0.00 | 0.1034 | 0.0833 | 0.1250 | 0.0589 | 0.0205 | 0.0261 | 0.0000 | 0.0000 | reg oper account | block of flats | 0.0214 | Stone, brick | No | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 1 |
| 456253 | 0 | Cash loans | F | 0 | 1 | 0 | 153000.0 | 677664.0 | 29979.0 | 585000.0 | Unaccompanied | Working | Higher education | Separated | House / apartment | 0.005002 | -14966 | -7921 | -6737.0 | -5150 | NaN | 1 | 1 | 0 | 1 | 0 | 1 | Managers | 1.0 | 3 | 3 | 9 | 0 | 0 | 0 | 0 | 1 | 1 | School | 0.744026 | 0.535722 | 0.218859 | 0.1031 | 0.0862 | 0.9816 | 0.7484 | 0.0123 | 0.00 | 0.2069 | 0.1667 | 0.2083 | NaN | 0.0841 | 0.9279 | 0.0000 | 0.0000 | 0.1050 | 0.0894 | 0.9816 | 0.7583 | 0.0124 | 0.0000 | 0.2069 | 0.1667 | 0.2083 | NaN | 0.0918 | 0.9667 | 0.0 | 0.0000 | 0.1041 | 0.0862 | 0.9816 | 0.7518 | 0.0124 | 0.00 | 0.2069 | 0.1667 | 0.2083 | NaN | 0.0855 | 0.9445 | 0.0000 | 0.0000 | reg oper account | block of flats | 0.7970 | Panel | No | 6.0 | 0.0 | 6.0 | 0.0 | -1909.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 4 |
| 456254 | 1 | Cash loans | F | 0 | 1 | 0 | 171000.0 | 370107.0 | 20205.0 | 319500.0 | Unaccompanied | Commercial associate | Secondary / secondary special | Married | House / apartment | 0.005313 | -11961 | -4786 | -2562.0 | -931 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2.0 | 2 | 2 | 9 | 0 | 0 | 0 | 1 | 1 | 0 | Business Entity Type 1 | NaN | 0.514163 | 0.661024 | 0.0124 | NaN | 0.9771 | NaN | NaN | NaN | 0.0690 | 0.0417 | NaN | NaN | NaN | 0.0061 | NaN | NaN | 0.0126 | NaN | 0.9772 | NaN | NaN | NaN | 0.0690 | 0.0417 | NaN | NaN | NaN | 0.0063 | NaN | NaN | 0.0125 | NaN | 0.9771 | NaN | NaN | NaN | 0.0690 | 0.0417 | NaN | NaN | NaN | 0.0062 | NaN | NaN | NaN | block of flats | 0.0086 | Stone, brick | No | 0.0 | 0.0 | 0.0 | 0.0 | -322.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3 |
| 456255 | 0 | Cash loans | F | 0 | 0 | 0 | 157500.0 | 675000.0 | 49117.5 | 675000.0 | Unaccompanied | Commercial associate | Higher education | Married | House / apartment | 0.046220 | -16856 | -1262 | -5128.0 | -410 | NaN | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 2.0 | 1 | 1 | 20 | 0 | 0 | 0 | 0 | 1 | 1 | Business Entity Type 3 | 0.734460 | 0.708569 | 0.113922 | 0.0742 | 0.0526 | 0.9881 | NaN | 0.0176 | 0.08 | 0.0690 | 0.3750 | NaN | NaN | NaN | 0.0791 | NaN | 0.0000 | 0.0756 | 0.0546 | 0.9881 | NaN | 0.0178 | 0.0806 | 0.0690 | 0.3750 | NaN | NaN | NaN | 0.0824 | NaN | 0.0000 | 0.0749 | 0.0526 | 0.9881 | NaN | 0.0177 | 0.08 | 0.0690 | 0.3750 | NaN | NaN | NaN | 0.0805 | NaN | 0.0000 | NaN | block of flats | 0.0718 | Panel | No | 0.0 | 0.0 | 0.0 | 0.0 | -787.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 4 |
307511 rows × 121 columns
Tipos de variables: Categoricas, Continuas y Booleanas¶
Por medio de la funcion clasificar_variables ( ), definimos los tipos de variables, añadiendo una lista de variables no clasificadas para las variables que no entraron dentro de las demás secciones.¶
In [13]:
f_aux.clasificar_variables(df)
Variables Booleanas: 36 ['TARGET', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EMERGENCYSTATE_MODE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21'] ============================================================================================================================================================================ Variables Categóricas: 14 ['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE'] ============================================================================================================================================================================ Variables Continuas: 65 ['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'OWN_CAR_AGE', 'CNT_FAM_MEMBERS', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR'] ============================================================================================================================================================================ Variables no clasificadas: 6 ['CNT_CHILDREN', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH', 'HOUR_APPR_PROCESS_START', 'NWEEKDAY_PROCESS_START']
Out[13]:
(['TARGET', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EMERGENCYSTATE_MODE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21'], ['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE'], ['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'OWN_CAR_AGE', 'CNT_FAM_MEMBERS', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR'], ['CNT_CHILDREN', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH', 'HOUR_APPR_PROCESS_START', 'NWEEKDAY_PROCESS_START'])
In [14]:
lista_var_cat, lista_var_con, lista_var_bool, lista_var_no_clasificadas = f_aux.clasificar_variables(df)
Variables Booleanas: 36 ['TARGET', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EMERGENCYSTATE_MODE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21'] ============================================================================================================================================================================ Variables Categóricas: 14 ['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE'] ============================================================================================================================================================================ Variables Continuas: 65 ['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'OWN_CAR_AGE', 'CNT_FAM_MEMBERS', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR'] ============================================================================================================================================================================ Variables no clasificadas: 6 ['CNT_CHILDREN', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH', 'HOUR_APPR_PROCESS_START', 'NWEEKDAY_PROCESS_START']
Con base a las variables no clasificadas, generamos la funcion nueva_clasificar_variables( ), donde se presenta un formato similar a la anterior función, sin embargo, se cuenta con un bucle que itera sobre las variables no clasificadas, añadiendoles al tipo de variable determinado.¶
In [15]:
f_aux.nueva_clasificar_variables(df)
Variables Booleanas: 36 ['TARGET', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EMERGENCYSTATE_MODE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21'] ============================================================================================================================================================================ Variables Categóricas: 16 ['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'CNT_CHILDREN', 'NWEEKDAY_PROCESS_START'] ============================================================================================================================================================================ Variables Continuas: 69 ['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'OWN_CAR_AGE', 'CNT_FAM_MEMBERS', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH', 'HOUR_APPR_PROCESS_START'] ============================================================================================================================================================================ Variables no clasificadas: 0 []
Out[15]:
(['TARGET', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EMERGENCYSTATE_MODE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21'], ['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'CNT_CHILDREN', 'NWEEKDAY_PROCESS_START'], ['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'OWN_CAR_AGE', 'CNT_FAM_MEMBERS', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH', 'HOUR_APPR_PROCESS_START'], [])
Guardar el nuevo CSV¶
In [16]:
df.to_csv('/Users/miguelflores/Desktop/CSV/pd_data_initial_preprocessing.csv')